Parallel SGD: When does averaging help?
نویسندگان
چکیده
Consider a number of workers running SGD independently on the same pool of data and averaging the models every once in a while — a common but not well understood practice. We study model averaging as a variance-reducing mechanism and describe two ways in which the frequency of averaging affects convergence. For convex objectives, we show the benefit of frequent averaging depends on the gradient variance envelope. For non-convex objectives, we illustrate that this benefit depends on the presence of multiple optimal points. We complement our findings with multicore experiments on both synthetic and real data.
منابع مشابه
Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging
This work characterizes the benefits of averaging techniques widely used in conjunction with stochastic gradient descent (SGD). In particular, this work sharply analyzes: (1) mini-batching, a method of averaging many samples of the gradient to both reduce the variance of a stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few i...
متن کاملParallel Dither and Dropout for Regularising Deep Neural Networks
Effective regularisation during training can mean the difference between success and failure for deep neural networks. Recently, dither has been suggested as alternative to dropout for regularisation during batch-averaged stochastic gradient descent (SGD). In this article, we show that these methods fail without batch averaging and we introduce a new, parallel regularisation method that may be ...
متن کاملExperiments on Parallel Training of Deep Neural Network using Model Averaging
In this work we apply model averaging to parallel training of deep neural network (DNN). Parallelization is done in a model averaging manner. Data is partitioned and distributed to different nodes for local model updates, and model averaging across nodes is done every few minibatches. We use multiple GPUs for data parallelization, and Message Passing Interface (MPI) for communication between no...
متن کاملWeighted parallel SGD for distributed unbalanced-workload training system
Stochastic gradient descent (SGD) is a popular stochastic optimization method in machine learning. Traditional parallel SGD algorithms, e.g., SimuParallel SGD [1], often require all nodes to have the same performance or to consume equal quantities of data. However, these requirements are difficult to satisfy when the parallel SGD algorithms run in a heterogeneous computing environment; low-perf...
متن کاملParallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging
We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multicore machines. In order to be as hardwareagnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural netwo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1606.07365 شماره
صفحات -
تاریخ انتشار 2016